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Detailed Description Text - DETX (38) : 

The choice for Pr ( j . epsilon. HO ) depends 
on how the background pixels b.sub.j 
are selected. In one embodiment, b.sub.j 
are the pixels of the previous frame . 
In this case, Pr (j .epsilon. HO) would be 
close to 1 . In another embodiment, a 
good background estimate is used and 
Pr (j .epsilon. HO) is closer to 0.5 in 
typical video sequences. The threshold 
value generated using Equation (6) may- 
be made temporally adaptive by updating the 
choice for Pr ( j . epsilon . HO ) based 
on the foreground/background segmentation 
results for the previous frames. For 
example, the number of blocks identified as 
background in the previous frame 
relative to the total number of blocks per 
frame could be used as an estimate 
of the probability that a pixel of the 
current frame is a part of the 
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ABSTRACT 



Blocks of video images are characterized as being part of 
either scene foreground or background for encoding. The 
foreground/background segmentation analysis involves a 
pixel level and a block level. During the pixel level, inter- 
frame differences corresponding to each original image are 
thresholded to generate an initial pixel-level mask. A first 
morphological filter is applied to the initial pixel-level mask 
to generate a filtered pixel-level mask. During the block 
level, the filtered pixel-level mask is thresholded to generate 
an initial block-level mask. A second morphological filter is 
preferably applied to the initial block- level mask to generate 
a filtered block- level mask. Each element of the filtered 
block-level mask indicates whether the corresponding block 
of the original image is part of the foreground or back- 
ground. In a preferred embodiment, both morphological 
filters filter out isolated mask elements. 

18 Claims, 12 Drawing Sheets 
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FIG. 4. GAIN CORRECTION 
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FIG. 12. FOREGROUND/BACKGROUND SEGMENTATION 
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ENCODING VIDEO IMAGES USING 
FOREGROUND/BACKGROUND 
SEGMENTATION 

This application is a continuation of U.S. patent appli- 
cation Ser. No. 08/536,981, filed on Sep. 29, 1995, now 
abandoned. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to image processing, and, in 
particular, to encoding video images. 

2. Description of the Related Art 

In teleconferencing applications, video sequences typi- 
cally consist of a two distinct layers: a background layer and 
a foreground layer. The background layer consists of the 
static objects in the scene that ideally should be coded and 
sent to the receiver only once. Conversely, the foreground 
layer consists of objects that move and change shapl as time 
progresses. By concentratinig bit allocation on pixels in the 
foreground layers, more efficient video encoding can be 
achieved. To achieve this goal, some video coders perform 
foreground/background segmentation to determine which 
portions of the video images correspond to foreground and 
which to background. In general, background regions cor- 
respond to portions of the scene that do not significantly 
change from frame to frame. 

Accurate foreground/background segmentation can be 
thwarted when the video images are generated by a video 
camera that performs automatic gain control (AGC). AGC is 
performed to ensure that the subject (i.e., a foreground 
object) falls well within the dynamic range of the camera. 
Unfortunately, AGC causes interframe differences to occur 
in regions that are spatially static (e.g., background regions). 
This can result in undesirable increases in the bitrate. It can 
also lead to misidentification of background regions as being 
part of the foreground. 

What is needed is a video encoding scheme that addresses 
the bitrate and foreground/background segmentation prob- 
lems created by using video cameras with automatic gain 
control. 

It is accordingly an object of this invention to overcome 
the disadvantages and drawbacks of the known art and to 
provide an improved scheme for encoding video streams 
generated by video cameras operating with automatic gain 
control. 

Further objects and advantages of this invention will 
become apparent from the detailed description of a preferred 
embodiment which follows. 

SUMMARY OF THE INVENTION 

The present invention comprises a computer- 
implemented process, an apparatus, and a storage medium 
encoded with machine -readable computer program code for 
encoding images. According to a preferred embodiment, 
interframe differences for an original image are thresholded 
to generate an initial pixel-level mask. A first morphological 
filter is applied to the initial pixel-level mask to generate a 
filtered pixel -level mask. The filtered pixel-level mask is 
thresholded to generate an initial block-level mask. The 
image is encoded based on the initial block-level mask. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Other objects, features, and advantages of the present 
invention will become more fully apparent from the follow- 
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ing detailed description of the preferred embodiment, the 
appended claims, and the accompanying drawings in which: 

FIG. 1 is a block diagram of a video system for encoding 
video signals in a PC environment, according to a preferred 
5 embodiment of the present invention; 

FIG. 2 is a computer system for decoding the video 
signals encoded by the computer system of FIG. 1, accord- 
ing to a preferred embodiment of the present invention; 
1Q FIG. 3 is a block diagram of a system for correcting gain, 
according to a preferred embodiment of the present inven- 
tion; 

FIG. 4 is a flow diagram of the processing performed by 
the gain-correction system of FIG. 3; 
15 FIG. 5 is an example of an original image; 

FIG. 6 is an initial pixel-level mask corresponding to FIG. 

5; 

FIG. 7 is a filtered pixel-level mask corresponding to FIG. 

6; 

20 

FIG. 8 is another example of a filtered pixel-level mask; 
FIG. 9 is an initial block- level mask corresponding to 
FIG. 8; 

FIG. 10 is a filtered block-level mask corresponding to 
25 FIG. 9; 

FIG. 11 is a block diagram of a system for performing 
foreground/background segmentation, according to a pre- 
ferred embodiment of the present invention; and 

FIG. 12 is a flow diagram of the processing implemented 
30 by the foreground/background segmentation system of FIG. 
11. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENT^) 

35 , t „ 

The present invention is directed to video encoding sys- 
tems that correct for the gain associated with video cameras 
that perform automatic gain control. The gain-corrected 
images are then analyzed to identify blocks that correspond 
40 to scene foreground and those that correspond to scene 
background. This foreground/background segmentation 
may be used to determine how to encode the image. The 
segmentation results may also be used during the gain 
correction processing of subsequent video frames. 

45 System Hardware Architectures 

Referring now to FIG. 1, there is shown a computer 
system 100 for encoding video signals, according to a 
preferred embodiment of the present invention. Analog-to- 

50 digital (A/D) converter 102 of encoding system 100 receives 
analog video signals from a video source. The video source 
may be any suitable source of analog video signals such as 
a video camera or VCR for generating local analog video 
signals or a video cable or antenna for receiving analog 

55 video signals from a remote source. A/D converter 102 
decodes (i.e., separates the signal into constituent 
components) and digitizes the analog video signals into 
digital video component signals (e.g., in one embodiment, 
8-bit R, G, and B component signals). 

60 Capture processor 104 captures the digitized component 
signals received from converter 102. Capturing may include 
one or more of color conversion (e.g., RGB to YUV), 
scaling, and subsampling. Each captured video frame is 
represented by a set of three two-dimensional component 

65 planes, one for each component of the digitized video 
signals. In one embodiment, capture processor 104 captures 
video signals in a YUV9 (i.e., YUV 4:1:1) format, in which 
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every (4x4) block of pixels of the Y-component plane ing board and for performing video encoding. Host proces- 

corresponds to a single pixel in the U -component plane and sor 116 is preferably an Intel® general-purpose 

a single pixel in the V-component plane. Capture processor microprocessor such as an Intel® i486™, Pentium™, or 

104 selectively stores the captured signals to memory device higher processor. System bus 114 may be any suitable digital 

112 and/or mass storage device 120 via system bus 114. 5 signal transfer device and is preferably a Peripheral Com- 

Those skilled in the art will understand that, for real-time pone nt Interconnect (PCI) bus. Memory device 112 may be 

encoding, the captured signals are preferably stored to any suitable computer memory device and is preferably one 

memory device 112, while for non-real-time encoding, the 0f m0fe dynamic random access memory (DRAM) devices, 

captured signals are preferably stored to mass storage device High-speed memory interface 110 may be any suitable 

_ . . . i , , 10 means for interfacing between memory device 112 and host 

During real-tame encoding, host processor 116 reads the processor 116 Mass st0 device 120 may be any suitable 

captured bitmaps from memory device 112 via high-speed mMns for s , ori A ^ %{ , s ^ fa ferab] a com . 

memory interface 110 and generates encoded video signals ,„ hard drfve Transmittcr 118 ma bc a suilablc mcans 

that represent the captured vtdeo signals. Depending upon for transmitting digilal signals to a remotc receiver . ^ 

the particular encoding scheme implemented, host processor J$ skiU( , d in the M wfll understand that lhe 6ncoded video 

116 applies a sequence of compression steps to reduce the si b ma be tralBmilted usi suitable means of 

amount of data used to represent in the information in the lransmission such ^ , elephone lin6j RF antenna> local area 

video signals. The encoded video signals are then stored to network> 0f wide area network 

memory device 112 via memory interface 112. and/or mass n c ■ . . * , t . . 

a • nn • * L -i -| j j w , --^ Rercrnng again to FIG. 2, decoding system 200 is pref- 

storage device 120 via system bus 114. Host processor 116 „ n , , . 6 . , n - 6 / . ., F , 

° , , : , . . \ 20 erably a microprocessor-based PC sys em similar to the 

may copy the encoded video signals to mass storage device , • n ^ . r 4 J nn , , , , 

aL . »u a ^ j * i * . basic PC system of encoding system 100. In particular, host 

120 and/or transmit the encoded video signals to transmitter i„ D , .» r i f 

11C f/i , i . — - • * . / . processor 208 may be any suitable means for decoding 

118 for real-time transmission to a remote receiver (not j _. _, ■ , _, ■ r.i , 

shown in FIG 1} encoded video signals and is preferably an Intel® general 

n . . " . ^.^ „ . . purpose microprocessor such as an Intel® i486™. 

Referring now the FIG. 2 there is shown a computer 25 Pentium™, or higher processor. System bus 206 may be any 

system 200 for aecoding 5 the video signals encoded by suitab[e di itaJ gi al transfef device and fc ferabl a pc , 

encoding system 100 of FIG. 1, according to a preferred bus Mass 6 d£vice 2U be means for 

embodiment of toe present invent.on Encoded video signals stori di i[al sj als and ^ preferably a CD-ROM device 

are either read from mass storage device 212 of decoding or a hard drfve no be suitaWe means for 

system 200 or received by receive, "210 from a remote 30 receiving the di gi, a l signa l s transmitted by transmitter 118 of 

transmitter such as transmitter 118 of FIG. 1 The encoded encodm | * ^Display processor 202 may be any 

bus e 206 ^ 10 raem0ry " ™ SyStem SuitaWe dCVice f0r P rocessin 8 video signals for display 

(including converting the digital video signals to analog 

Host processor 208 accesses the encoded signals stored in vidco signals) and is pre f era bly implemented through a 

memory device 214 via high-speed memory interface 216 35 PC-based display system such as a VGA or SVGA system. 

and decodes the encoded video signals for display. Decoding Monitor 2 04 may be any means for displaying analog 

the encoded video signals involves undoing the compression signals and fe prefcrably a VGA monitor. 

processing implemented by encoding system 100 of FIG. 1. In a ferred embodimerit , encoding system 100 of FIG. 

Host processor 208 stores the decoded video signals to 1 and decodi 2QQ of FIQ { m twQ ^ nQ{ 

memory device 214 via memory interface 216 from where 40 cr s , n an alternative fcrrcd cmbodimenl 

they are transmitted to display processor 202 via system bus ^ P#u , • . , . 

Al( i i_ . <m\o • . t , , of the present invention, a single computer system compris- 

206. Alternatively, host processor 208 transmit ; the decoded ^ a „ of , he d;fferen( ncnts of ^ 100 ^oo 

video signals directly to display processor 202 via system ^ used 

to encode and decode video signals. Those 

bus 206. replay processor 202 P«™ssosthe decoded video skil , cd ^ ^ ^ ^ undcr5tand that * 

signals for display on monitor 204. The processing of 45 tem ma be used , 0 di k decoded videQ si b in 

display processor 202 includes digital-to- analog conversion i , • / , , c 

f f. J / , . . Ar f . . . * " T real-time to monitor the capture and encoding of video 

ot the decoded video signals. After being decoded by host signals 

processor 208 but before being D/A converted by display , u , 4 c A . . , 

processor 202, the decoded video signals may be upsampled . j n alte ™ a " Ve embodl[n f ents of P resent lnventl ° n . ^ 

(e.g., from YUV9 to YUV24), scaled, and/or color converted 50 Vl ^° c ^ Passing of an encoding system and/or the 

(e.g., from YUV24 to RGB24). Depending upon the par- V,de ° f K code P 10 " 581 ^ of a decoAng system may be 

tlcuhr embodiment, each of these processing steps may be 1SSlSted ^ a Pf 1 *°™««> such 88 an Intel® i750PE™ 

implemented by either host processor 208 or display pro- P 1 ^ 550 /. °"»r suitable component(s) to off-load pro- 

cessor 202 cessing from the host processor by performing computation- 

™ c - 4 , . „ ally intensive operations. 

Referring again to FIG. 1, encoding system 100 is pref- 55 

erably a microprocessor-based personal computer (PC) sys- Gain Correction 

tem with a special purpose video-processing plug-in board. Background regions are typically defined as those regions 

In particular, A/D converter 102 may be any suitable means with relatively small interframe differences from frame to 

for decoding and digitizing analog video signals. Capture frame. Automatic gain control presents problems for 

processor 104 may be any suitable processor for capturing so foreground/background segmentation. A slight change in 

digitized video component signals as subsampled frames. In gain may produce a relatively large amount of energy in the 

a preferred embodiment, A/D converter 102 and capture difference image, which may lead to incorrect classification 

processor 104 are contained in a single plug-in board of a large area of an image as foreground. Since a goal of the 

capable of being added to a microprocessor-based PC sys- present invention is to code only those areas of the scene 

tcm * 65 composed of foreground objects, misclassifying spatially 

Host processor 116 may be any suitable means for con- static regions as foreground would have an adverse affect on 

trolling the operations of the special-purpose video process- achieving that goal. 
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To address the problems caused by automatic gain In any case, since no multiplications are involved, Equation 

control, the present invention preferably performs gain cor- (3) is computationally simpler than Equation (2). 

rection. Gain correction involves two general steps: (1) Background areas that are saturated high are preferably 

characterizing the gain associated with the current image not used to calculate the gain factor a. Likewise, areas with 

and (2) correcting for that estimated gain. 5 very low illumination levels (i.e., saturated low) will tend to 

The gain for the current image is characterized by com- underestimate the gain because of the inherent quantization 

paring pixels of the current image that are part of the image involved in producing a digital image. As such, background 

background to a set of reference pixels. In a preferred pixels that do not fall within specified upper and lower 

embodiment of the present invention, a particular (16x16) threshold values are preferably not used to generate the gain 

macroblock of the image is selected for use in characterizing i° factor a. 

gain. For example, in video conferencing, where the para- Estimation of a using either Equations (2) or (3) requires 

digm is a "talking head"on a static background, a macrob- knowledge of where the background is located in the current 

lock in the upper left corner of the frame may be initially frame. Thus, gain correction and background detection are 

selected as being part of the image background for use in dependent functions. A priori knowledge about the scene or 

characterizing gain. 15 prediction based on past statistics is used to estimate initially 

The set of reference pixels (i.e., the reference macroblock) cithcr a or the background areas. Blocks near the border of 

is preferably generated from the corresponding macroblocks thc ima S e tnat navc remained part of the background for a 

of the most recent frames. For example, the reference number of previous frames can serve as an adequate starting 

macroblock may be generated by averaging the correspond- P omt ^ or estimating a. 

ing pixels from the n previous frames. That is, pixel (0,0) of 20 After the S ain factor a ^ estimated, the current image is 

the reference macroblock is the average of the (0,0) pixels corrected for gain by multiplying the pixels of the current 

from the corresponding macroblocks of each of the n ima 6 e b y a * In 3 preferred embodiment, only those pixels 

previous frames. In this embodiment, the n previous frames that are P art of background regions are corrected for gain, 

are buffered for use in updating the reference macroblock for Moreover, gain correction is not applied in saturated regions 

each new frame. Alternatively, the reference macroblock can 25 (either high or low) of the background, 

be updated without retaining the actual frame data for the n Referring now to FIG. 3, there is shown a block diagram 

previous frames. For example, the pixels of the reference of a system for correcting gain, according to a preferred 

macroblock may be updated according to the following embodiment of the present invention. Reference macroblock 

formula: processor 302 uses the previous reference macroblock and 

30 the previous image to generate an updated (or new) refer- 

n _ t ! (1 j ence macroblock. Gain characterizer 304 uses the updated 

~ ~~^~ 8i+ reference macroblock and the current image to characterize 

the gain associated with the current imiage. Gain corrector 

. 306 uses the characterized gain to apply gain correction to 

where n is the number of frames used to generate the 35 thc appropriate pixels of the current image to generate a 

reference macroblock, g w is a pixel of the reference mac- gain-corrected current image, 

roblock for the next frame, g, is the corresponding pixel of Referring now to FIG. 4, there is shown a flow diagram 
the reference macroblock for the current frame, and f, is the of the processing performed by the gain-correction system 
corresponding pixel of the current frame. The reference of piG. 3, according to a preferred embodiment of the 
macroblock may be generated using other techniques as 4 o present invention. If the reference macroblock continues to 
well, e.g., the median of the last n frames. correspond to an unsaturated background region of the scene 
If it is determined (e.g., during foreground/background ( step 402 of FIG. 4), then the updated reference macroblock 
segmentation) that the reference macroblock does not cor- generated from the previous reference macroblock and the 
respond to the background regions of the frame, then another previous frame (step 404). Otherwise, the reference mac- 
macroblock (e.g.. the upper nght corner of the frame) may 45 roblock no longer corresponds to a region that may be used 
be selected for use in generating the reference macroblock. to characterize gain. In that case, a new unsaturated back- 
In one embodiment, the gain factor a for the current frame ground macroblock is selected and used to generate a new 
is characterized using to the following equation: reference macroblock (step 406). 

After the reference macroblock has been either updated or 

_ ^j/MtfM ^ 50 generated anew, the reference macroblock and the corre- 

a ~ Zf 2 [x) sponding macroblock of thc current frame arc used to 

characterize the gain associated with the current frame, 
using either Equation (2) or (3) (step 408). 
e * After the gain has been characterized, steps 410 and 416 
o g[x] are the pixels of the reference macroblock; and 55 combine to sequentially select all of the pixels of the current 
o f[x] are the pixels of the macroblock of the current frame. If the current pixel is part of an unsaturated back- 
frame corresponding to the reference macroblock. ground region (step 412), then gain correction is applied 
In another embodiment, the gain factor a is estimated using (step 414). Otherwise, the current pixel is either a saturated 
the following equation: pixel (either high or low) or part of a foreground region or 

60 both. In those cases, gain correction is not applied and the 

^g[x) (3) pixel retains its original value. 

° = E/M B y ^""ecting f° r the effects of automatic gain control, the 

present invention provides robust segmentation of an image 
into foreground/background regions. This gain correction 

Equation (2) gives the minimum mean square error (MMSE) 65 also increases the likelihood that motion estimates corre- 

estimate of the gain, while Equation (3) calculates the gain spend to the true motion in the scene. The present invention 

with parameters that may already be known in the encoder. attempts to normalize camera gain in the background and 
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acquire a nearly complete estimate of the background layer 
over time. This background estimate can then be used to 
segment each layer into foreground/background regions. 

Foreground/Background Segmentation 

After correcting for gain, foreground/background seg- 
mentation is performed to identify foreground and back- 
ground regions of the current image. Segmentation analysis 
can be performed at different resolutions (i.e., granularities). 
Pixel-based segmentation has the advantage of following the 
boundaries of foreground objects more closely than block- 
based segmentation. Some disadvantages of simple pixel - 
based techniques are that connectivity is not encouraged and 
it does not fit easily into the framework of a block-based 
coding algorithm. The connectivity problem can be 
addressed by incorporating information from neighboring 
pixels into the classification process at each pixel. In a 
preferred embodiment of the present invention, the ultimate 
goal is to incorporate the segmentation information into a 
block-based compression scheme. Thus, the preferred seg- 
mentation process results in a block-wise separation of each 
image into foreground/background regions. 

The segmentation analysis of the present invention has a 
pixel level and a block level. At the pixel level, pixel 
differences between the current frame and a reference frame 
are thresholded for each frame to yield a pixel mask indi- 
cating changed pixels. The block level takes the pixel-level 
results and classifies blocks of pixels as foreground or 
background. The basis for classification is the assumption 
that significantly changed pixels should occur only in the 
foreground objects. 

Pixel Level 

Thresholding is used to identify changed pixels in each 
image plane. In one embodiment, a threshold is generated by 
considering a maximum likelihood estimate of changed 
regions. Every pixel in each image belongs to one of the 
following two sets: HO (background pixels) and HI (non- 
background or foreground pixels). For each location j, let 
pixel difference oV-p-by, where p y - is the pixel value at 
position j and b y is the reference value at position j. The 
reference values by are part of a reference frame. The 
reference frame is preferably generated from the previous n 
frames using the same technique employed in generating the 
reference macroblock for gain correction, as described in the 
previous section. 

For pixels in set HO, dy is expected to be a zero-mean 
Gaussian-distributed random variable. For pixels in set HI, 
py and by are assumed to be independent random variables 50 
uniformly distributed between 0 and 255. These assump- 
tions yield the following equations: 



Simplifying the likelihood ratio for these distributions 
yields the following equation: 



25 



30 



35 



40 



45 



where p(dy/jeH0) is the probability that d y can take on a 
certain value given that the pixel at location j is part of the 
background and p(dy/jeHl) is the probability that d y can take 65 
on a certain value given that the pixel at location j is part of 
the foreground. 



Idyl ^2cr2k>g| 



255 I - PrijeHO) 



(6) 



where Pr(jeH0) is the probability that the pixel at location j 
is part of the background. If a is selected to be 3 pixel 
intensity levels and if Pr(jeH0) is assumed to be 0.5, the 
Equation (6) reduces to the following relation: 



(7) 



That is, if the pixel difference dy for the pixel at location j has 
a value of 8 or more, then the pixel is said to be part of the 
foreground. Otherwise, the pixel is said to be part of the 
background. 

The choice for Pr(jeH0) depends on how the background 
pixels by are selected. In one embodiment, by are the pixels 
of the previous frame. In this case, Pr(jeH0) would be close 
to 1, In another embodiment, a good background estimate is 
used and Pr(jeH0) is closer to 0.5 in typical video sequences. 
The threshold value generated using Equation (6) may be 
made temporally adaptive by updating the choice for 
Pr(jeH0) based on the foreground/background segmentation 
results for the previous frames. For example, the number of 
blocks identified as background in the previous frame rela- 
tive to the total number of blocks per frame could be used 
as an estimate of the probability that a pixel of the current 
frame is a part of the background. 

A threshold is computed for each component plane in the 
image and the pixel differences for each component plane 
are thresholded using the corresponding threshold value. An 
initial pixel- level mask is formed by ORing the thresholded 
planes. The initial pixel- level mask is a binary mask having 
a one-to-one correspondence between the mask elements 
and the pixels of the original image. A mask element is 1 if 
any of the pixel differences for the components of the 
corresponding image pixel are greater than the correspond- 
ing thresholds. Otherwise, the mask element is 0. 

After the initial pixel-level mask is generated, a morpho- 
logical filter is applied to decrease false foreground detec- 
tions which tend to occur along stationary edges. If M p is the 
initial pixel-level mask, then a preferred morphological filter 
is given by the following equation: 



(8) 



wherein M' ; 
and: 



is the filtered mask, denotes convolution, 



h = 



0 1 0 

1 2 1 
0 1 0 



(9) 



60 



According to Equation (8), if the result of applying matrix 
h to a (3x3) portion of the initial pixel-level mask M p greater 
than or equal to 4, then the corresponding filtered element of 
the filtered pixel- level mask M'^ is set to I to indicate that the 
filtered element is part of the foreground. Otherwise, the 
corresponding filtered element in the filtered pixel-level 
mask is set to 0 to indicate that the filtered element is part 



11/12/2003, EAST Version: 1.4.1 



5,915,044 

9 10 

of the background. The morphological filter of Equation (8) Referring now to FIGS. 8-10, there are shown, 

forces isolated foreground pixels to the background and respectively, another example of a filtered pixel-level mask, 

isolated background pixels to the foreground. an initial block . leve i mas k generated by thresholding the 

Referring now to FIGS. 5, 6, and 7, there are shown, cu , . . . . . CT ,^ 0 , „ 1t . . , . , , 

~o~^t;„*i„ _ „ ]a ' ' -J • „ fc • filtered pixel-level mask of FIG. 8, and a filtered block-level 

respectively, an example of an original image, an initial 5 r »_ i • i C i 

pixel-level mask generated by thresholding the original mask generated by applying the morphological filter of 

image of FIG. 5, and a filtered pixel-level mask generated by Equation (10) to the initial block-level mask of FIG. 9. 

applying the morphological filter of Equation (8) to the Referring now to FIG. 11, there is shown a block diagram 

initial pixel-level mask of FIG. 6. of a system for performing foreground/background 

Block Level 10 segmentation, according to a preferred embodiment of the 

At the block level, each block of elements of the filtered present invention. Pixel -level thresholder 1102 thresholds 

pixel-level mask is thresholded to determine whether the the original image to generate the initial pixel-level mask, 

block corresponds to a foreground block or a background Pixel-level filter 1104 applies the morphological filter of 

block. This folding step involves adding up the number Equation (8) to the initial pixel-level mask to generate the 

of elements of the block of the filtered pixel-level mask that 15 , . . . „. . . . . ,° „ n<r 

correspond to the foreground (i.e., have a value of 1) and filter ^ d P™"™! mask. Block-level thresholder 1106 

then comparing that sum to a specified threshold. If the thresholds the filtered pixel-level mask to generate the initial 

number of foreground elements in the block is greater than block- level mask. Block-level filter 1108 applies the mor- 

the specified threshold, then the block is said to be a pho logical filter of Equation (10) to the initial block-level 

foreground block. Otherwise, the block is said to be a 2 o mask to generate the filtered block-level mask, 

background block. The result of this thresholding step is an D f . „ . . „ .. 

initial block-level mask. Each element of the initial block- herring now to FIG. 12, there is shown a flow diagram 

level mask corresponds to a block of elements of the filtered of lhe Pressing implemented by the foreground/ 

pixel-level mask and therefore to a block of pixels of the background segmentation system of FIG. 11, according to a 

original image. The initial block-level mask is a binary 25 preferred embodiment of the present invention. A threshold 

mask, such that an element of the initial block-level mask value is selected for each component plane of the current 

having a value of 1 corresponds to a foreground block, while image (step 1202 of FIG. 12). The selected threshold values 

an element having a value of 0 corresponds to a background are then used to threshold the interframe differences for each 

block. Experimental results indicate that, for an (NxN) component plane (step 1204). The initial pixel-level mask is 

block, the threshold value should lie between N/4 and N. 3Q then generated by ORing the thresholded planes together 

At the block level, it is also desirable to have a solid (step 1206). Under this embodiment, a pixel will be desig- 

foreground mask. Unfortunately, when the background is nated as being part of the foreground in the initial pixel-level 

not precisely known, holes tend to occur in the interior of mask if any of the interframe differences for its components 

slowly moving smooth foreground objects. To reduce the exceed the corresponding threshold value. The morphologi- 

number of holes in the foreground, a morphological operator 35 C al filter of Equation (8) is then applied to the initial 

is applied to the initial block-level mask. For an initial pixel-level mask to generate the filtered pixel-level mask 

block- level mask denoted M t , a preferred morphological ( ste p 1208) 

operator is described by the following equation: . . L , ^ _, 

The filtered pixel -level mask is then thresholded to gen- 
erate the initial block- level mask (step 1210). The morpho- 

b JWIV 1 w logical filter of Equation (10) is then applied to the initial 

block-level mask to generate the filtered block- level mask 

where M f fr is the filtered block -level mask, "*"denotes ( ste P 1212). 

convolution, U designates the "union" or "OR" operation, The present invention can be embodied in the form of 

45 computer-implemented processes and apparatuses for prac- 

( u > ticing those processes. The present invention can also be 
embodied in the form of computer program code embodied 
in tangible media, such as floppy diskettes, CD-ROMs, hard 
drives, or any other computer- readable storage medium, 

50 wherein, when the computer program code is loaded into 

h h ={\ 2 LJ (12) aQ d executed by a computer, the computer becomes an 

apparatus for practicing the invention. The present inven- 
ltion can also be embodied in the form of computer program 

According to Equation (10), if an element of the initial co de, for example, whether stored in a storage medium, 

block-level mask M 6 is 1, or if either of the two correspond- 55 loaded inl0 and/or executed by a computer, or transmitted 

ug matrix products is two or more, then the corresponding over some transmission medium> such ^ over electrical 

element of the filtered block-level mask M b » set to 1 to wifi or ^ throu ^ fiber { Qr yia elect 

a*? 15 P t t Si ^T^t ,° th r iSe i: netic radiation > wherein > " hen the computer program code is 

the corresponding element in the filtered block-level mask ijj-. j \ jl . . 

M' fc is set to 0 to indicate that the pixel is part of the 60 [° adcd mt0 and cxc<n ? c d by a , com P utcr ' *c computer 

background. The morphological operation of Equation (10) beCOmCS a0 a PP aratus for Pricing the invention. 

tends to close small holes in the foreground. The filtered u wil1 be Artier understood that various changes in the 

block-level mask indicates which blocks of the original details, materials, and arrangements of the parts which have 

image are part of the foreground and which are part of the been described and illustrated in order to explain the nature 

background. This information can then be used to determine 65 of this invention may be made by those skilled in the art 

how to distribute the processing resources (e.g., computation without departing from the principle and scope of the 

time and bitrate) to encode the blocks of the current image. invention as expressed in the following claims. 



and 
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What is claimed is: 

1. A computer-implemented process for encoding images, 
comprising the steps of: 

providing interframe differences between pixels of an 
original image and a set of reference pixels and thresh- 
olding the interframe differences to generate an initial 
pixel-level mask for the original image comprising a 
plurality of pixel-level mask elements, each indicating 
one of a background and a non -background status for a 
corresponding pixel of the original image; 

(b) using a first morphological filter to filter out isolated 
elements in the initial pixel-level mask to generate a 
filtered pixel-level mask; 

(c) thresholding blocks of mask elements of the filtered 
pixel-level mask to generate an initial block-level mask 
comprising a plurality of block-level mask elements, 
each indicating one of the background and the non- 
background status for a corresponding block of pixels 
of the original image, said thresholding comprising the 
step of setting each element of the initial block-level 
mask to indicate the non-background status only if a 
number of pixel-level mask elements in a correspond- 
ing block of the filtered pixel-level mask exceeds a 
specified threshold value, and otherwise setting said 
each element of the initial block-level mask to indicate 
the background status; 

(d) using a second morphological filter to filter out 
isolated elements in the initial block- level mask to 
generate a filtered block-level mask, 

and 

(e) encoding the image based on the filtered block-level 
mask, 

2. The process of claim 1, wherein step (a) comprises the 
steps of: 

(1) selecting a threshold value for each component plane 
of the original image; 

(2) thresholding interframe difference for each component 
plane based on the corresponding selected threshold 
value; and 

(3) generating the initial pixel-level mask based on the 
thresholded component planes, 

3. The process of claim 1, wherein the set of reference 
pixels comprises a reference block for each block of the 
original image, where each reference block for each block is 
generated by averaging corresponding pixels for corre- 
sponding blocks from a plurality of previous frames. 

4. The process of claim 3, wherein the first morphological 
filter is defined by: 



wherein: 

is the initial pixel-level mask; 
M'^ is the filtered pixel-level mask; 
denotes convolution; and 



h = 



25 



40 
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5. The process of claim 1, wherein the second morpho- 
logical filter is defined by: 

Mi = M b U [Wt • h v ) 21 2] U im *h h ) * 2] 

wherein: 

M b is the initial block-level mask; 
M' fc is the filtered block-level mask; 
U designates an OR operation: 
denotes convolution; 



and 



20 



Aa = U 2 1J. 

6. The process of claim 5, wherein: step (a) comprises the 
steps of: 

(1) selecting a threshold value for each component plane 
of the original image; 

(2) thresholding interframe difference for each component 
plane based on the corresponding selected threshold 
value; and 

(3) generating the initial pixel-level mask based on the 
thresholded component planes; 

the first morphological filter is defined by: 



35 



Mp = [M„*/j]*4 
wherein: 

M p is the initial pixel-level mask; 
M'^ is the filtered pixel-level mask; 
"* M denotes convolution; and 



h = 



50 



55 



60 
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7. An apparatus for encoding images, comprising: 

(a) means for providing interframe differences between 
pixels of an original image and a set of reference pixels 
and for thresholding the interframe differences to gen- 
erate an initial pixel- level mask for the original image 
comprising a plurality of pixel-level mask elements, 
each indicating one of a background and a non- 
background status for a corresponding pixel of the 
original image; 

(b) a first morphological filter for filtering out isolated 
elements in the initial pixel-level mask to generate a 
filtered pixel-level mask; 

(c) means for thresholding blocks of mask elements of the 
filtered pixel-level mask to generate an initial block- 
level mask comprising a plurality of block-level mask 
elements, each indicating one of the background and 
the non-background status for a corresponding block of 
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wherein: 

M b is the initial pixel-level mask; 
M' p is the filtered pixel-level mask; 
"*" denotes convolution; and 



11. The apparatus of claim 7, wherein the second mor- 
phological filter is defined by: 

Ml = M b U [(Mt *h v ) fc 2] U [{M b * h k ) * 2] 



wherein: 

M fa is the initial block- level mask; 
M' fc is the filtered block-level mask; 
U designates an OR operation; 
"*" denotes convolution; 

l 
2 
1 

and 
= U 2 Ij. 
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pixels of the original image, said means for threshold- 
ing comprising means for setting each element of the 
initial block-level mask to indicate the foreground 
status only if a number of pixel-level mask elements in 
a corresponding block of the filtered pixel-level mask 
exceeds a specified threshold value, and for setting said 
each element of the initial block-level mask to indicate 
the background status otherwise; 

(d) a second morphological filter for filtering out isolated 
elements in the initial block-level mask to generate a 
filtered block-level mask; and 

(e) means for encoding the image based on the filtered 
block-level mask. 

8. The apparatus of claim 7, wherein means (a): 

(1) selects a threshold value for each component plane of 
the original image; 

(2) thresholds interframe difference for each component 
plane based on the corresponding selected threshold 
value; and 

(3) generates the initial pixel-level mask based on the 
thresholded component planes. 

9. The apparatus of claim 7, wherein the set of reference 
pixels comprises a reference block for each block of the 
original image where each reference block for each block is 
generated by averaging corresponding pixels for corre- 
sponding blocks from a plurality of previous frames. 

10. The apparatus of claim 9, wherein the first morpho- 
logical filter is defined by: 



ID 



12. The apparatus of claim 11, wherein: means (a): 

(1) selects a threshold value for each component plane of 
the original image; 

(2) thresholds interframe difference for each component 
plane based on the corresponding selected threshold 
value; and 

(3) generates the initial pixel-level mask based on the 
thresholded component planes; 

the first morphological filter is defined by: 
M; = [W p */i]a4 



15 



wherein: 
M 



p is the initial pixel -level mask; 
M' p is the filtered pixel- level mask; 
"*" denotes convolution; and 



25 



35 



45 



13. A storage medium having stored thereon a plurality of 
instructions for encoding images, wherein the plurality of 

30 instructions, when executed by a processor, cause the pro- 
cessor to perform the steps of: 

(a) providing interframe differences between pixels of an 
original image and a set of reference pixels and thresh- 
olding the interframe differences to generate an initial 
pixel-level mask for the original image comprising a 
plurality of pixel-level mask elements each indicating 
one of a background and a non-background status for a 
corresponding pixel of the original image; 

(b) using a first morphological filter to filter out isolated 
elements in the initial pixel-level mask to generate a 
filtered pixel-level mask; 

(c) thresholding blocks of mask elements of the filtered 
pixel -level mask to generate an initial block-level mask 
comprising a plurality of block-level mask elements, 
each indicating one of the background and the non- 
background status for a corresponding block of pixels 
of the original image said thresholding comprising the 
step of setting each element of the initial block-level 
mask to indicate the non-background status only if a 
number of pixel-level mask elements in a correspond- 
ing block of the filtered pixel-level mask exceeds a 
specified threshold value, and otherwise setting said 
each element of the initial block-level mask to indicate 
the background status; 

(d) using a second morphological filter to filter out 
isolated elements in the initial block-level mask to 
generate a filtered block-level mask; and 

(e) encoding the image based on the filtered block-level 
mask. 

14. The storage medium of claim 13, wherein step (a) 
comprises the steps of: 

(1) selecting a threshold value for each component plane 
of the original image; 



50 



55 



60 
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(2) thresholds interframe difference for each component 
plane based on the corresponding selected threshold 
value; and 

(3) generating the initial pixei-level mask based on the 
thresholded component planes. 

15. The storage medium of claim 13, wherein the set of 
reference pixels comprises a reference block for each block 
of the original image, where each reference block for each 
block is generated by averaging corresponding pixels for 
corresponding blocks from a plurality of previous frames. 

16. The storage medium of claim 15, wherein the first 
morphological filter is defined by: 

M' p = [M p *h}*4 



wherein: 

M p is the initial pixel-level mask; 
M' p is the filtered pixel-level mask; 
denotes convolution; and 



20 



0 l 0 

1 2 l 
0 l 0 



17. The storage medium of claim 13, wherein the second 
morphological filter is defined by: 

K = M b (J [(M b * h v ) a 2] U [<Af* * h h ) * 2] 



wherein: 

M b is the initial block-level mask; 
M' 6 is the filtered block-level mask; 



30 



16 



U designates an OR operation; 
denotes convolution; 



and 

A/, = U 2 lj. 

18. The storage medium of claim 17, wherein: 
step (a) comprises the steps of: 

(1) selecting a threshold value for each component plane 
of the original image; 

(2) thresholding interframe difference for each component 
. plane based on the corresponding selected threshold 

value; and 

(3) generating the initial pixel-level mask based on the 
thresholded component planes; 

the first morphological is defined by: 



25 



wherein: 

M p is the initial pixel -level mask; 
M' p is the filtered pixel-level mask; 
denotes convolution; and 



h = 



0 l o 

1 2 l 
0 I 0 



35 
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